Things to never form in matrix computations

1. The Matrix Inverse

Commonly, the solution to systems of linear equations is expressed in the form $x = A^{-1}b$ where A is a square matrix and $x$ and $b$ are vectors. In most numerical programming environments (e.g. numpy, Julia, MATLAB), this expression can be literally evaluated as something like this:

x = inv(A)*b, or x = numpy.linalg.inv(A)@(b)

One should resist the temptation to translate formulae literally. It is almost always superior to write

x = A \ b or x = numpy.linalg.solve(A, b) because it is both faster and usually more accurate on poorly-conditioned systems.

This MATLAB documentation explains further how these are different and why it is almost always better to use the \ or solve operations. If solving multiple systems is required, then the matrix $A$ can be factorized via $A = LU$ (any non-singular matrix), $A = LL^H$ (hermitian positive definite), or $A = LDL^H$ decompositions (hermitian). Solving linear systems after such factorizations is a highly-optimized operation in the BLAS library underlying environments like Julia, numpy, and MATLAB. Inverting $A$ usually requires one of these factorizations internally, in addition to forming the inverse from said factorization. Factorizing first, then directly solving the system of equations doesn’t require the inverse, reducing computation time and numerical error.

2. A diagonal matrix from a vector

Some formulae require turning a vector into a diagonal matrix so as to do elementwise scaling on a vector, or row-wise/column-wise scaling on a matrix. Here are a couple examples:

The singular value decomposition is commonly expressed as below, for example in the numpy documentation here. $X = U * diag(S) * V$
For linear machine learning models (logistic regression, least-squares, etc.), the Hessian matrix for Newton’s method is calculated as below $H = X^T * diag(\frac{\partial^2L}{\partial (Xb)^2}) * X$

NumPy, Julia, and MATLAB provide a function diag that turns a vector $x$ of length $n$ into a diagonal matrix $D$ of size $n \times n$ , where $x[i] = D[i,i]$ . However, this function should almost never be necessary because all three of the above environments provide broadcasting functionality which has the same effect for much less computational cost and storage requirement.

For example, take the following expression:

A = U * diag(s) * Vt

For a matrix A with rank 10,000 , diag(s) requires storing 100,000,000 ( $10^8$ ) elements, of which 99,990,000 are zero. Furthermore, assuming U and diag(s) are $10000 \times 10000$ matrices, the first matrix multiplication requires on the order of $10^4 \times 10^4 \times 10^4 = 10^{12}$ floating-point operations (more than a trillion multiplies + adds).

In Julia, the call to diag(s) can be replaced by Diagonal(s) with no changes to the rest of the code, as in

A = U * Diagonal(s) * Vt.

This only requires storing 10,000 elements, and the first matrix multiplication only requires on the order of $10^8$ floating-point operations. This works because multiplication by a diagonal matrix ( $A * D$ ) is equivalent to scaling the columns of $A$ by the elements of $D$ , and $D*A$ works the same except it scales the rows of $A$ .

In numpy , the more efficient solution uses broadcasting, which works the same way as in Julia, but isn’t a simple drop-in replacement.

In numpy, multiplying a matrix by a diagonal matrix (represented as a vector) can be done as follows:

A = (U * s)@(Vt), or A = (U * s[None,:])@(Vt) if you want to be more specific.

Left-multiplying by a diagonal can be done like so: A = U@( s[:, None] * Vt )

3. The Covariance/Gram Matrix

The gram matrix is given by $A^HA$ for a matrix $A$ . The gram matrix or its scaled variants (such as the covariance matrix) represent a third type of matrix that should (almost) never be formed. The reason for this is that there are two main applications for the gram matrix: Using it in a system of equations, or finding its eigen-system.

The gram matrix is usually involved in a system of equations when the objective is to find a least-squares approximation. This system usually looks like

$(A^HA)b = A^Hy$ .

While the system can be solved by explicitly computing cholfact(A'A) \ (A'y) (Julia syntax), this only works when the Gram matrix is full rank (i.e. A has full column rank). In this case, the QR decomposition method is more numerically stable in the case of a poorly-conditioned system. The QR factorization approach is also the default one implemented for the proper solution in this case, which is simply A \ y.

Additionally, when the Gram matrix is not full-rank, the factorization is impossible. In this case, the backslash operator or solve function in NumPy automatically switch to the SVD-based solution, while the explicit Gram matrix solution will just throw an error.

The other main reason the Gram matrix is commonly formed is to find its eigen-system for procedures like Principal Components Analysis (PCA) or data whitening. However, these can be more easily and stably computed via the Singular Value Decomposition. Here’s how:

Let $A^HA = V\Sigma V^H$

However, $V$ and $\Sigma$ can also be obtained through the singular value decomposition of A

$A = USV^H$ , and $S = \sqrt{\Sigma}$

The SVD method not only avoids computing the Gram matrix, but doesn’t waste time computing singular values with value 0, something that the eigenvalue solver cannot figure out as easily.

Conclusions

Three common matrices in linear algebra computing never need to be formed: the inverse, full diagonal, and Gram matrix. While it can be tempting to literally translate from formulae to code, there are usually highly optimized methods for solving these common problems. Furthermore, even though it is true that one should avoid premature optimization, these optimizations come for free, both in terms of compute cost and often in readability.